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1.  INTRODUCTION 


An  increasing  need  exists  to  maintain  a  Common  Operational  Picture  (COP)  between  a  collection  of  hosts  within 
a  disconnected,  intermittent  and  low-bandwidth  (DIL)  Navy  maritime  environment.  We  propose  the  use  of  set  recon¬ 
ciliation  algorithms  to  help  address  this  issue.  Set  reconciliation  algorithms  possess  the  following  properties  which 
make  them  promising  foundations  for  new  technology: 

1.  They  require  nearly  optimal  communication  overhead. 

2.  They  only  use  a  single  round  of  communication. 

3.  Many  implementations  possess  low  computational  complexity. 

As  a  first  step,  in  this  technical  document,  we  survey  the  state-of-the-art  with  regards  to  the  problem  of  set  reconcilia¬ 
tion. 


The  set  reconciliation  problem  is  the  following:  Suppose  Host  A  and  Host  B  each  have  a  set  of  length-6  binary 
strings,  denoted  Sa  and  Sb ■  We  briefly  note  that,  in  practice,  the  length-6  binary  strings  may  be  the  output  of  some 
hash  function  which  is  computed  on  the  discrete  data  elements  which  constitute  a  Navy  Command  and  Control  (C2) 
data  store.  The  problem  is  to  determine  which  information  must  be  sent  between  Host  A  and  Host  B  so  that  each  host 
can  compute  Sa  U  Sb  when  provided  a  single  round  of  communication.  In  other  words,  the  protocols  discussed  in 
this  document  allow  one  exchange  between  A  and  B  after  which  both  Host  A  and  Host  B  can  consequently  compute 
Sa  U  Sb-  The  goal  in  this  survey  is  to  evaluate  existing  set  reconciliation  algorithms  in  terms  of  the  total  amount  of 
information  exchange  as  well  as  their  computational  complexity.  We  partition  the  existing  methods  into  three  general 
classes:  (1)  methods  based  on  error-correcting  codes,  (2)  methods  based  on  polynomial  interpolation,  and  (3)  methods 
based  on  Bloom  filters. 

This  document  is  organized  as  follows.  In  Section  2.,  we  discuss  algorithms  for  set  reconciliation  based  upon 
error-correcting  codes  inspired  by  the  works  [1],  [8],  and  [11].  Section  3.  describes  a  method  for  set  reconciliation  that 
leverages  polynomial  interpolation  as  in  [10].  In  Section  4.  we  describe  an  algorithm  for  set  reconciliation  that  uses 
Bloom  filter  structures  [5],  [6],  [7].  Lastly,  in  Section  5.,  we  conclude  this  survey  by  summarizing  ongoing  work  and 
identifying  potential  directions  for  future  research. 


2.  ERROR-CORRECTING  CODES 

In  this  section,  we  describe  an  algorithm  for  set  reconciliation  that  involves  the  direct  usage  of  error-correcting 
codes.  The  primary  advantage  to  this  approach  is  that  it  provides  nearly  optimal  communication  overhead.  The 
principal  drawback  to  this  approach  is  that  the  computational  complexity  is  high.  The  basic  idea  is  to  represent 
collections  of  6-strings  as  vectors  of  length-6. 

For  a  positive  integer  m,  let  Bm  :  {0, 1,2,...,  2^  —  1}  — ►  Z™  be  a  function  that  outputs  rn  bits  which  are  the 
binary  representation  of  an  integer  provided  as  input.  Thus,  B%(2)  — >  (0, 1,  0)  and  B:i ( 1 )  =  (0,  0, 1),  for  instance. 
Clearly,  Bm  is  invertible.  Let  M  :  — >■  Z|  .  Then,  for  any  input  x,  M (x)  =  ( y0,  yi, ,  j^-i)  is  a  vector,  with 

yB-i(x)  =  1  and  Vi  =  0  otherwise.  Under  this  representation,  we  represent  our  sets  Sl  ( L  £  { A ,  B})  by  vectors  vB 
(L  £  { A ,  B}),  where 


vL  =  ^2  M(x). 

xGSl 


Suppose  H  £  ZjX^  is  a  parity  check  matrix  for  a  code  that  corrects  up  to  2d  +  1  errors  and  suppose  |  (<!?4  \  5^ )  U 
(Sb  \  Sa)  |  <  d.  Then,  using  H  we  compute 


hL  =  H  ■  vL, 


(1) 
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for  L  £  { A ,  B}.  Equation  (1)  represents  the  encoding  which  takes  place  on  both  Host  A  and  Host  B.  Notice  that  one 
potential  disadvantage  with  this  setup  is  that  va,vb  are  exponential  in  the  parameter  b  so  that  the  size  of  the  vectors 
va,  v b  could  be  quite  large,  making  the  matrix  multiplication  in  Equation  (1)  an  expensive  operation. 

The  next  step  is  for  Host  A  to  send  h,A  to  Host  B  and  Host  B  to  send  h/;  to  Host  A.  In  the  following,  we  focus 
on  the  computation  of  Sa  U  Sb  on  Host  B.  The  logic  on  Host  A  is  identical.  Since  Host  B  now  has  Ha,  hs ,  Host  B 
can  compute 


Ha  +  Hb  =  H  ■  va  +  H  ■  vb  =H  ■  (  M(x)  +  M(x) 

\x€lSa  x£Sb 

=  H- 

x£(Sa  \Sb  )U  (Sb  \Sa) 

=  H  •  M(e), 

where  M(e)  =  J}}xe(sA\SB)u(SB\SA)  M(x).  For  a  vector  v  £  Z|,  let  wt(v)  denote  the  number  of  non-zero  elements 
in  v.  Thus,  by  assumption,  wt{M(e))  <  d  and,  since  H  is  the  parity  check  matrix  for  a  code  with  minimum  distance 
2d  +  1,  we  can  uniquely  determine  5Za,e($j4\$B)u(sB\5A)  M(x)  from  which  the  set  {Sa  \  Sb)  U  {Sb  \  Sa)  can  be 
reconstructed. 

Let  E{b,d)  denote  the  total  amount  of  information  exchange.  It  can  be  shown  [12]  that  the  total  amount  of 
information  that  is  required  to  be  transmitted  from  Host  A  to  Host  B  (and  likewise  from  Host  B  to  Host  A)  is  at  most 

E(b,  d)  <  db. 

The  following  theorem  summarizes  the  discussion  from  this  section. 

Theorem  1.  There  exists  a  set  reconciliation  protocol  that  requires  db  bits  of  information  exchange  with  encoding 
complexity  0{d  ■  2b)  and  decoding  complexity  0{d  ■  2b). 


3.  POLYNOMIAL  INTERPOLATION 

In  this  section,  we  describe  the  set  reconciliation  approach  from  [10]  that  leverages  polynomial  interpolation.  We 
first  describe  the  approach  along  with  some  of  its  properties.  To  describe  the  polynomial  interpolation  method,  we  first 
introduce  the  concept  of  the  characteristic  polynomial  ([2],  [9])  which  will  serve  to  represent  the  sets  Sa,  Sb  on  Host 
A  and  Host  If  respectively.  For  a  set  S  =  { X\ ,  X-2, . . . ,  Xn}  C  GF{q)  where  q  >  2b,  we  define  the  characteristic 
polynomial  of  S  as 

Xs{z)  =  (z  -  Xl  ){z  -X2)---(z-  Xn). 

Host  A  and  Host  B  first  agree  upon  d  evaluation  points,  denoted  { !\  ,  P2, . . . ,  If }  C  GF{q),  and  we  assume 
|(<S/t  \  Sb)  U  {Sb  \  iS^)|  <  d.  Host  A  transmits  the  result  of  evaluating  xsA{z)  at  each  of  the  d  evaluation  points 
along  with  the  number  <S/i|  to  Host  If  Host  B  transmits  the  result  of  evaluating  \SB  (z)  at  each  of  the  d  evaluation 
points  along  with  | Sb  \  to  Host  A.  Similar  to  before,  we  describe  the  process  of  determining  Sa  U  Sb  on  Host  B  since 
the  process  on  Host  A  is  the  same. 

At  this  point.  Host  B  has  the  following  information: 

1.  |5a|,  |«Sb|,  and 

2.  {xsA  {Pi),  XsA  (Pi),---,  Xsa  (Pd)}-  {xsB  (Pi),XsB  (Pi),---,XsB  (Pd)}- 

On  Host  B,  we  seek  to  recover  the  polynomial 

XSa  (^)  _  X.Sa\Sb  (^0 

XsB(z)  Xsb\sa(z )’ 
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from  which  the  set  Sa  U  Sb  can  be  easily  determined.  Without  loss  of  generality,  suppose  1S4  >  \Sb\  and  denote 
cIa  =  |<Sa  \  Sb |,  ds  =  | Sb  \  <Sa|-  Let  5  =  |<Sa|  -  \Sb\-  Then,  d,A  <  and  d b  <  ■  Denote 

XsA\SB{z)  _  q(z)_  _  zdA  +  qdA-\zdA~l  -\ - h  go 

XsB\SA(z)  r(z )  zdB  +rdB_1zd*-1  H - h  r0 ' 

Then,  for  *  £  {1,2,...,  d},  we  have  that 

XSA\SB  (Pi)  _  PjA  +  QdA-lPjA  1  +  •  •  •  +  90  ^2) 

XsB\sA{Pi)  PfB  +  rdB-iP.fB  1 1  fro 

If  Ft  =  Xsa\sb(P‘)  t|len  for  jg  {1, 2, . . . ,  d}  we  can  rewrite  Equation  (2)  as 

XSB\SA 

PdA  +  qdA-iPfA~l  +  •  •  •  +  do  =  Fi  [PtB  +  Ub-xP^-1  +  ■  ■  ■  +  r0)  .  (3) 

Since  it  can  be  shown  the  equations  from  Equation  (3)  are  linearly  independent  [13],  we  can  uniquely  determine 

— B,p  ,  from  the  previous  derivations. 

XSb\Sa 

Notice  that  the  polynomial  interpolation  method  removed  the  requirement  that  operations  are  performed  over 
binary  vectors  of  length  2b.  However,  the  encoding  procedure  does  require  operations  over  a  field  of  size  q  >  2b 
so  that  it  is  unclear  whether  the  polynomial  interpolation  method  provides  any  meaningful  advantages  in  terms  of 
encoding  complexity  over  the  approach  which  uses  error-correcting  codes.  The  communication  overhead  is  the  same 
as  the  method  from  the  previous  section  (O(db)),  while  the  decoding  complexity  is  0(d3).  Notice  that  if  d2  «  2b , 
then  the  polynomial  interpolation  method  may  offer  a  substantial  improvement  in  decoding  complexity. 

The  next  theorem  summarizes  the  discussion  from  this  section. 

Theorem  2.  There  exists  a  set  reconciliation  protocol  that  requires  db  bits  of  information  exchange  with  decoding 
complexity  0(d3). 


4.  BLOOM  FILTER 

In  this  section,  we  consider  a  slightly  different  approach  than  those  taken  in  the  previous  two  sections.  Recall 
that  in  the  previous  two  sections,  the  protocols  discussed  guarantee  recovery  of  <Sa  U  Sb  whenever  (<S,4  \  Sb)  U 
(Sb  \  S  \ )  <  d.  In  contrast,  the  Bloom  filter  approach  allows  recovery  of  64  U  S />  with  high  probability  whenever 
|(<S>a  \  Sb)  U  (Sb  \  «Sa)|  <  d.  For  the  remainder  of  this  section,  we  discuss  a  variation  of  the  protocol  from  [5]  which 
uses  the  invertible  Bloom  filter  first  discussed  in  [6],  Similar  ideas  were  also  used  in  [3]  and  [7]. 

On  each  host,  a  special  type  of  Bloom  filter,  known  as  an  invertible  Bloom  filter  (IBF)  is  created.  These  hosts 
first  agree  on  two  hash  functions  H  and  Hk  for  some  positive  integer  k.  The  IBF  is  a  collection  of  n  cells.  For  each 
fo-string,  /Y hashes  it  into  k  cells  of  the  IBF.  Each  cell  then  contains  three  fields: 

1 .  idSum  :  XOR  of  all  6-strings  that  have  hashed  into  the  cell. 

2.  hashSum  :  XOR  of  H  of  all  6-strings  that  have  hashed  into  the  cell. 

3.  count :  an  integer  counting  the  number  of  times  the  cell  has  been  hashed  to. 

The  idea  will  be  for  Host  A  to  compute  an  IBF,  denoted  BgA ,  and  for  Host  B  to  compute  an  IBF,  denoted  B$b  . 
Afterwards,  A  and  B  will  exchange  IBFs  and,  from  the  IBFs,  Host  A  and  B  will  determine  64  U  Sb-  We  assume  that 
we  have  the  hash  functions  hi, . . . ,  hk  where  for  i  £  {I, . . . ,  k},  hi  :  hb2  Zn.  H  =  ho  and  Hk  =  (hi, ...,  hk)- 
The  computation  of  B$A  is  detailed  in  Algorithm  1  when  S  =  Sa  and  similarly  Bsb  is  computed  from  Algorithm  1 
when  S  =  Sb-  It  is  assumed  that  at  the  start  of  the  algorithm  the  IBFs  BgA ,  Bsb  are  empty.  Let  ©  denote  the  XOR 
operation  so  that  (1,  0, 1, 0)  <8»  (1, 1,  0,  0)  =  (0, 1, 1,  0). 
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Algorithm  1:  IBF  Encode 

input  :  The  set  S  C  GF(2b) 

output:  The  IBF  Bg 

1  for  every  x  S  do 

2  for  i  =  1  :  k  do 

3  Bg  \hi  {x)\.id,Sum  ~  Bg[hi(x')(  .idSum,  ©  %'•> 

4  Bg  [hi  (xf  hashSum  =  Bg[hi(xf  hashSum  ®  H{x)\ 

5  Bg  [hi  (^?)].  count  ~  Bg[hi(x)]  Count  T 

6  end 

7  end 


After  Host  A  computes  BgA  and  Host  B  computes  BgB,  Host  A  sends  BgA  to  Host  B  and  similarly  Host  B 
sends  BgB  to  Host  A.  The  decoding  procedure  detailed  in  Algorithm  2  is  performed  on  Both  Host  A  and  Host  B. 
The  operation  BgB.j  —  BgA.j  returns  an  array  where  the  value  in  the  /-th  cell  of  the  resulting  array  is  equal  to 
BgB  [i].j  —  BgA  [i].j  for  i  £  {1,2,...,  n}  and  where  j  is  one  of  the  three  fields  idSum ,  hashSum ,  count.  Similarly 
the  operation  BgB.j  ®  BgA.j  returns  an  array  where  the  value  in  the  i- th  cell  of  the  resulting  array  is  equal  to 
BgB  [ i].j  ®  BgA  [i].j  for  i  €  {1, 2, . . . ,  n}  and  where  j  is  one  of  the  three  fields  idSum ,  hashSum,  count. 

Algorithm  2:  IBF  Decode 

input  :  BgA,  BgB 

output:  An  estimate  T  of  (5,4  \  Sb )  U  ( Sb  \  Sa) 

1  T=  0; 

2  f=l; 

3  B  idSum  —  BgB  idSum  ®  BgA  .idSumi 

1  B .hashSum  BgB.  hashSum  ©  B Sa  .hashSum? 

5  B , count  —  BsB'COUnt  BsA  COUnt, 

6  while  i  <  n  do 

7  if  account  =  ±1  AB[£]  .hashSum  —  H(B[i]AdSum )  then 

8  X  =  B[£(  gdSumi 

9  T  =  T  U  X\ 

10  for  i  =  1  :  k  do 

11  B[hi  (at)].  idSum  =  B[hi(x)]  idSum  ® 

12  B\hi(yx)\  hashSum  ~  B[hi(xf  .hashSum  ©  B(x)', 

12  B\hi{x)\  .COUnt  ~  B [hi (x)]  count  B[£] .count. 

14  end 

is  £  =  0; 

16  end 

17  £  =  £+1; 

18  end 

19  If  B  count  does  not  contain  all-zeros,  then  a  decoding  error  has  occurred. 


When  the  number  of  cells  in  the  IBF  satisfies  n  >  d(k  +  1),  we  have  the  following  theorem  which  follows  in  a 
straightforward  manner  from  the  ideas  in  [5], 

Theorem  3.  There  exists  a  set  reconciliation  protocol  that  requires  O(db)  bits  of  information  exchange  with  encoding 
complexity  0(max{|5,4|,  |<Sb|})  and  decoding  complexity  0(d)  that  computes  Sa  U  Sb  with  probability  at  least 
1  —  0(d~k)  when  1(5,4  \Sb )  U  (Sb  \5a)|  <  d. 
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5.  APPLICATIONS  AND  FUTURE  WORK 


In  this  section,  we  comment  on  one  possible  Navy  system,  known  as  Maritime  Tactical  Command  and  Control 
(MTC2),  which  could  benefit  from  the  use  of  set  reconciliation  algorithms.  Afterwards,  we  consider  directions  for 
future  work. 

MTC2  is  the  follow-on  to  Global  Command  and  Control  System-Maritime  (GCCS-M)  and  provides  capabilities 
that  include  situational  readiness  and  planning  for  Navy  Tactical  environments.  One  of  the  core  components  of  MTC2 
is  the  data  layer  which  abstracts  the  implementation  of  the  underlying  data  store  from  MTC2  applications.  In  particular, 
the  MTC2  data  layer  provides  a  RESTful  interface  to  a  schemaless  database.  All  documents  stored  within  the  data 
layer  have  the  format  shown  in  Figure  1 . 


document  structure 

{"_id"  :  {  "$oid"  :  "<mongo  uuid>")  , 

■3 context”  :  { 

"mtc2 " :  "https : / /mil . navy . mtc2 / 1 . 0 " , 

"gbase"s  null, 

"gvocab":  "mtc2s" 

>  t 

"gid"  :  "<java  uuid>", 

"gtype"  s  [ "<some  type>"], 

"versionVector" s  {<version  vector),  //  Version  Vector  -  see  below 
"createdDate" :  <date>, 

"createdBy" 8  <userld>, 

"metadata" s [ {"name" : "<some  name>” , "value" s "<some  value>"),  ...]}  //  Application 
defined 

"data"  s  {Current  Document), 

"isDeleted"  8  <true  |  false>, 

"isHistory"  s  <true  |  false>, 

> 


Figure  1.  Document  Structure. 

Suppose  that  two  data  stores,  denoted  Host  A  and  Host  B,  within  MTC2  want  to  synchronize  their  information. 
One  simple  approach  under  this  setup  could  be  the  following.  First,  suppose  a  hash  field  is  added  to  each  document 
(where  the  hash  is  computed  using  the  data  content  within  each  document).  Host  A  determines  which  hashes  it  has 
that  Host  B  does  not  have  and  similarly  Host  B  determines  the  hashes  it  has  that  Host  A  does  not  have.  As  mentioned 
previously,  if  a  set  reconciliation  algorithm  is  used  to  determine  the  hash  differences,  then  only  a  single  round  of 
communication  is  required  and  the  amount  of  information  exchanged  between  Host  A  and  Host  B  is  nearly  optimal. 
Host  A  then  sends  to  Host  B  the  documents  corresponding  to  the  hashes  Host  A  has  that  Host  B  does  not  and  similarly 
Host  B  sends  a  set  of  documents  to  Host  A. 

As  a  concrete  example  of  the  benefit  of  using  these  algorithms  suppose  Host  A  has  a  set  of  documents  which  are  1 
KB  each  in  size  and  similarly  Host  B  has  a  set  of  documents  each  of  size  1  KB.  Furthermore,  suppose  we  use  a  32-bit 
Cyclic  Redundancy  Check  (CRC32)  hash  so  that  the  size  of  each  hash  is  32  bits.  Suppose  Host  A  has  10  documents 
that  Host  B  does  not  have  and  similarly  Host  B  has  10  documents  that  Host  A  does  not  have.  Then,  using  the  approach 
described  in  the  previous  paragraph  would  require  the  transmission  of  164480  bits.  If  Host  A  knew  all  the  documents 
Host  B  had  and  Host  B  knew  all  the  documents  Host  A  had,  then  163840  bits  of  information  would  be  exchanged 
so  that  under  this  setup,  our  approach  would  be  nearly  optimal  in  terms  of  the  total  amount  of  information  exchange. 
Furthermore,  notice  that  only  two  rounds  of  communication  would  be  required  (one  round  for  the  hashes  and  another 
to  transmit  the  documents).  Such  approaches  may  be  suitable  for  degraded  naval  networks  where  bandwidth  may  be  a 
scarce  resource. 

Some  directions  for  future  work  include  the  following: 

1 .  Set  reconciliation  algorithms  for  data  elements  that  are  related. 

2.  New  reconciliation  algorithms  with  nearly  optimal  information  exchange  that  possess  less  computational  com¬ 
plexity  than  polynomial  interpolation. 


5 


3.  Algorithms  with  security  constraints. 


The  first  item  above  is  motivated  by  the  setup  where  a  host  makes  minor  changes  to  a  document  between  syn¬ 
chronization  rounds.  For  instance  suppose  Host  A  and  B  initially  have  a  single,  identical  track.  Suppose  Host  A 
updates  the  track  location  information  and  leaves  the  rest  of  the  track  the  same  as  before.  We  would  like  to  have  an 
algorithm  for  synchronizing  A  and  B  that  only  transmits  the  updates  that  Host  A  made  to  the  track  rather  than  the 
entire  track  itself.  Such  a  method  has  the  potential  to  further  reduce  the  required  throughput  of  existing  algorithms. 
Some  preliminary  work  towards  the  development  of  such  algorithms  has  started  in  [4]. 

Recall  that  the  approach  outlined  in  Section  4.  reduced  the  computational  complexity,  but  the  amount  of  com¬ 
munication  overhead  was  then  increased  by  a  constant  factor.  This  constant  factor  of  additional  communication  may 
be  prohibitively  expensive  in  DIL  environments.  Consequently,  the  second  direction  enumerated  above  proposes  to 
reduce  the  communication  overhead  required  by  approaches  that  use  Bloom  filters  at  the  cost  of  potentially  increasing 
the  computational  complexity  of  the  algorithm  from  0(d)  to  0(d  log  d). 

The  third  item  identified  for  future  work  refers  to  the  scenario  where  a  collection  of  hosts  communicating  together 
have  different  security  privileges.  Therefore,  it  may  be  desirable  to  leverage  the  structure  of  the  set  reconciliation 
transmission  schemes  to  enhance  the  privacy  of  the  information  exchanged. 
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